Case Study 4: Bankruptcy Detection¶

Team: Ana Glaser, Jake Harrison, Rob Burigo, and Yvan Sojdehei¶

Business Understanding¶

In order to minimize the loss for the bank, we need to predict the likelihood that a company will go bankrupt in the future so that the bank divests their investments in the company before they go bankrupt.

Modeling Preparations¶

Methods:¶

The team is using both Random Forest and XGBoost. XGBoost works to optimize the error and given our unbalanced target class, it performs better than Random Forest. In preparation for the model, we loaded all of the .arff files into one dataframe and decoded the attributes which were formatted in a byte format. The data was cleansed and scaled prior to modeling.

Evaluation Metrics:¶

The use case requires accuracy be used as a metric, however, due to the unbalanced nature of the target class we will also measure precision and recall and weighted F1 as true measures of the model's ability to predict bankruptcy.

Importing Packages¶

import os
import pandas as pd
import numpy as np
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from scipy.io import arff
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier 
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from yellowbrick.model_selection import FeatureImportances
import seaborn as sns
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, classification_report, roc_auc_score,accuracy_score,average_precision_score, precision_recall_curve, plot_precision_recall_curve
import warnings
warnings.filterwarnings('ignore')

Data Evaluation / Engineering¶

Data files were loaded and consolidated. Every column contained Null values. The imputation technique used was to replace all null values with zero.¶

Importing Data. We consolidated all .arff files¶

data_file = os.listdir("./data/")
data_file

['1year.arff', '2year.arff', '3year.arff', '4year.arff', '5year.arff']

my_data = pd.DataFrame()
data_df = pd.DataFrame()
for i in data_file:
    path = "./data/"+i
    print(path)
    data = arff.loadarff(path)
    data_df = pd.DataFrame(data[0])
    my_data = my_data.append(data_df)
    
    
my_data['class'] = my_data['class'].str.decode('utf-8')

./data/1year.arff
./data/2year.arff
./data/3year.arff
./data/4year.arff
./data/5year.arff

my_data = my_data.rename(columns={'Attr1': 'net profit / total assets',
                                  'Attr2': 'total liabilities / total assets',
                                  'Attr3': 'working capital / total assets',
                                  'Attr4': 'current assets / short-term liabilities',
                                  'Attr5': '[(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365',
                                  'Attr6': 'retained earnings / total assets',
                                  'Attr7': 'EBIT / total assets',
                                  'Attr8': 'book value of equity / total liabilities',
                                  'Attr9': 'sales / total assets',
                                  'Attr10': 'equity / total assets',
                                  'Attr11': '(gross profit + extraordinary items + financial expenses) / total assets',
                                  'Attr12': 'gross profit / short-term liabilities',
                                  'Attr13': '(gross profit + depreciation) / sales',
                                  'Attr14': '(gross profit + interest) / total assets',
                                  'Attr15': '(total liabilities * 365) / (gross profit + depreciation)',
                                  'Attr16': '(gross profit + depreciation) / total liabilities',
                                  'Attr17': 'total assets / total liabilities',
                                  'Attr18': 'gross profit / total assets',
                                  'Attr19': 'gross profit / sales',
                                  'Attr20': '(inventory * 365) / sales',
                                  'Attr21': 'sales (n) / sales (n-1)',
                                  'Attr22': 'profit on operating activities / total assets',
                                  'Attr23': 'net profit / sales',
                                  'Attr24': 'gross profit (in 3 years) / total assets',
                                  'Attr25': '(equity - share capital) / total assets',
                                  'Attr26': '(net profit + depreciation) / total liabilities',
                                  'Attr27': 'profit on operating activities / financial expenses',
                                  'Attr28': 'working capital / fixed assets',
                                  'Attr29': 'logarithm of total assets',
                                  'Attr30': '(total liabilities - cash) / sales',
                                  'Attr31': '(gross profit + interest) / sales',
                                  'Attr32': '(current liabilities * 365) / cost of products sold',
                                  'Attr33': 'operating expenses / short-term liabilities',
                                  'Attr34': 'operating expenses / total liabilities',
                                  'Attr35': 'profit on sales / total assets',
                                  'Attr36': 'total sales / total assets',
                                  'Attr37': '(current assets - inventories) / long-term liabilities',
                                  'Attr38': 'constant capital / total assets',
                                  'Attr39': 'profit on sales / sales',
                                  'Attr40': '(current assets - inventory - receivables) / short-term liabilities',
                                  'Attr41': 'total liabilities / ((profit on operating activities + depreciation) * (12/365))',
                                  'Attr42': 'profit on operating activities / sales',
                                  'Attr43': 'rotation receivables + inventory turnover in days',
                                  'Attr44': '(receivables * 365) / sales',
                                  'Attr45': 'net profit / inventory',
                                  'Attr46': '(current assets - inventory) / short-term liabilities',
                                  'Attr47': '(inventory * 365) / cost of products sold',
                                  'Attr48': 'EBITDA (profit on operating activities - depreciation) / total assets',
                                  'Attr49': 'EBITDA (profit on operating activities - depreciation) / sales',
                                  'Attr50': 'current assets / total liabilities',
                                  'Attr51': 'short-term liabilities / total assets',
                                  'Attr52': '(short-term liabilities * 365) / cost of products sold)',
                                  'Attr53': 'equity / fixed assets',
                                  'Attr54': 'constant capital / fixed assets',
                                  'Attr55': 'working capital',
                                  'Attr56': '(sales - cost of products sold) / sales',
                                  'Attr57': '(current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation)',
                                  'Attr58': 'total costs /total sales',
                                  'Attr59': 'long-term liabilities / equity',
                                  'Attr60': 'sales / inventory',
                                  'Attr61': 'sales / receivables',
                                  'Attr62': '(short-term liabilities *365) / sales',
                                  'Attr63': 'sales / short-term liabilities',
                                  'Attr64': 'sales / fixed assets'})

my_data.head()

Displaying the summary of our dataset. Of note, we are dealing with attributes that are both positive and negative. As this is financial data, we will need to make sure not to look at these values as absolute.¶

my_data.describe()

Identification and imputation of Null values in the data¶

colms = np.where(my_data.isnull().sum() > 0)
print(colms)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], dtype=int64),)

my_data = my_data.fillna(0)

colms = np.where(my_data.isnull().sum() > 0)
print(colms)

(array([], dtype=int64),)

To understand the characteristics, we assesed mean attribute values in bankrupt vs. non-bankrupt companies. There appears to be clear distinction in some of the attribute values between our target classes. While building the model, we will explore which features are most influential in predicting bankrupt companies.¶

my_data.groupby(['class']).mean()

We will explore the initial data further by isolating some features we felt would be important to building a model and visualizing them with some boxplots. Notice there is clear separation between class 0 and class 1 for all three features. Looking at operating profit / expenses, you'll see non-bankrupt companies between 0 and 5 where the bankrupt companies are all between 0 and 1. The other two features have clear positive / negative separation.¶

sns.boxplot(x='class', y='profit on operating activities / financial expenses', data=my_data).set(title='Boxplot for profit on operating activities / financial expenses')

[Text(0.5, 1.0, 'Boxplot for profit on operating activities / financial expenses')]

sns.boxplot(x='class', y='(gross profit + depreciation) / total liabilities', data=my_data).set(title='Boxplot for (gross profit + depreciation) / total liabilities')

[Text(0.5, 1.0, 'Boxplot for (gross profit + depreciation) / total liabilities')]

sns.boxplot(x='class', y='gross profit (in 3 years) / total assets', data=my_data).set(title='Boxplot for gross profit (in 3 years) / total assets')

[Text(0.5, 1.0, 'Boxplot for gross profit (in 3 years) / total assets')]

Upon investigating the data, we have noticed that a lot of the variables contain similar components. We can assume that features that share similar components could have a high correlation and be detrimental to the model. With that in mind, we took 10 of our most important features and plotted their correlation with each other. We see correlation in (net profit + depreciation) / total liabilities and (gross profit + depreciation) / total liabilities which makes sense. We did not find it necessary to remove any highly correlated features from the model, as there was only one variable in this category.¶

my_data_top = my_data[['profit on operating activities / financial expenses', '(gross profit + depreciation) / total liabilities',
                      'gross profit (in 3 years) / total assets','profit on sales / sales','(current assets - inventory) / short-term liabilities',
                      '(net profit + depreciation) / total liabilities','(gross profit + depreciation) / sales','retained earnings / total assets',
                      'total liabilities / ((profit on operating activities + depreciation) * (12/365))','profit on operating activities / sales']]

data_corr = my_data_top.corr()  # grabs correlation variables of features
mask = np.zeros_like(data_corr, dtype=bool)  # returns array of zeros w/ same shape and type of given array
mask[np.triu_indices_from(mask)]= True  # Generate a mask for the upper triangle

f, ax = plt.subplots(figsize=(11, 9))  # Matplotlib figure setup / formats nicely
sns.heatmap(data_corr,
            mask = mask,
            square = True,
            cmap = 'coolwarm', # Easier visualization of correlated variables
            annot = True,
            annot_kws = {'size': 12})
ax.set(title='Heatmap of Features with Significant Importanct to our Models')

[Text(0.5, 1.0, 'Heatmap of Features with Significant Importanct to our Models')]

Modeling Preparation¶

We will build both a Random Forest Model and a XGBoost Model. The data will be down sampled due to the imbalance in the target class. Our assumption is it is more important to identify the possible bankrupt companies than it is to predict bankrupt vs. non-bankrupt. Thus we want a model that won't have to deal with imbalance.¶

We will evaluate the model performance based on the accuracy metric and evaluate the effectiveness of the models by using precision and recall, combined with the weighted F1 score, which is a balanced representation of these metrics .¶

Denoting the X and y¶

X = my_data.drop("class" , axis=1)
y = my_data['class']

Due to the imbalance in the target class, we have elected to go with a down sampling method in order to increase precision and recall while only taking a small hit to accuracy.¶

under_sample = RandomUnderSampler(sampling_strategy='all')
X_under, y_under = under_sample.fit_resample(X,y)

Our sampling methodoly is to perform a 70-30% train/test split on the merged dataframe that contains all datasets used.¶

X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under, test_size=.30, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)

We scaled the data based on the standard scaler of the training dataset. The under sampled data will be used for both the XGBoost and Random Forest models.¶

scaler = StandardScaler()

X_train_scaler = scaler.fit(X_train)
X_train_scaled = X_train_scaler.transform(X_train)
X_train_scaled = pd.DataFrame(data = X_train_scaled, columns = X_train.columns)

X_test_scaled = X_train_scaler.transform(X_test)
X_test_scaled = pd.DataFrame(data = X_test_scaled,columns = X_test.columns )

X_train_scaler_u = scaler.fit(X_train_under)
X_train_scaled_u = X_train_scaler.transform(X_train_under)
X_train_scaled_u = pd.DataFrame(data = X_train_scaled_u, columns = X_train_under.columns)

X_test_scaled_u = X_train_scaler_u.transform(X_test_under)
X_test_scaled_u = pd.DataFrame(data = X_test_scaled_u,columns = X_test_under.columns )

Model Building & Evaluation¶

Random Forest¶

Random Forest is a tree-based ensemble method which, using a technique called bagging, combines the results of multiple decision tree iterations to reach a single optimal result. Random Forest is designed to help prevent overfitting by creating random splits of data.¶

Building a random forest model using 100 estimators and the 'entropy' criterion, based on the results of a grid search, not shown here.¶

dTree = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state=1)
dTree = dTree.fit(X_train_scaled_u, y_train_under)
y_pred = dTree.predict(X_test_under)
accuracy_score(y_pred,y_test_under)

0.6717131474103586

The overall model accuracy is not extremely high, however the supporting performance metrics of weighted F1, class-level precision and recall emphasize the performance of the minority class, the purpose of our analysis, and show better results.¶

The Confusion Matrix displays the distribution of the classifications with the Random Forest model. With a recall score of .59 we observed that around 41% of the bankrupt companies were misclassified as not bankrupt, whereas 59% of the bankrupt companies were correctly predicted as bankrupt.¶

cnf_matrix = confusion_matrix(y_test_under, y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 352.48, 'Predicted label')

The confusion matrix and report below show a reasonable precision and recall score, capturing about three-quarters of all bankrupt companies, compared to the results of a random forest model with the original imbalanced data (without down-sampling)¶

print(classification_report(y_test_under, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.75      0.70       641
           1       0.69      0.59      0.64       614

    accuracy                           0.67      1255
   macro avg       0.67      0.67      0.67      1255
weighted avg       0.67      0.67      0.67      1255

XGBoost¶

XGBoost is also a tree-based ensemble method, but rather than using bagging like Random Forest, XGBoost improves on a single model by combining it with other weak models using targeted results to minimize error. As said aboive, Random Forest prevents overfitting whereas XGBoost works to prevent underfitting and bias. While the random forest technique can be performed in parallel, the XGBoost algorithm operates in sequence, making it more computationally expensive.¶

Building an XGBoost model with 35 n estimators and using mae as criterion, as determined by a grid search, not shown here.¶

abcl = GradientBoostingClassifier(n_estimators=35,criterion = 'mae', random_state=1999)
abcl = abcl.fit(X_train_scaled_u, y_train_under)
y_pred_xgb = abcl.predict(X_test_under)
accuracy_score(y_pred_xgb, y_test_under)

0.6231075697211156

This model has a slightly (5%) lower accuracy than the random forest model.¶

The Confusion Matrix displays the distribution of the classifications with the XGBoost model. With a recall score of .36 we observed that around 64% of the bankrupt companies were misclassified as not bankrupt, whereas 36% of the bankrupt companies were correctly predicted as bankrupt. This is a significant decrease in performance from the random forest model.¶

cnf_matrix = confusion_matrix(y_test_under, y_pred_xgb)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Text(0.5, 352.48, 'Predicted label')

print(classification_report(y_test_under, y_pred_xgb))

              precision    recall  f1-score   support

           0       0.59      0.87      0.70       641
           1       0.73      0.36      0.48       614

    accuracy                           0.62      1255
   macro avg       0.66      0.62      0.59      1255
weighted avg       0.66      0.62      0.60      1255

Below are the top 10 most important features for our Random Forest model. You can see a few of the features have total assets or total liabilites as a denominator. This makes sense at a macro view as we are trying to identify companies that have a chance of going bankrupt therefore it is logical that assets and liabilities are key figures in that determination.¶

Investors are willing to overlook raw sales / cash numbers as long as the asset / liability ratio is in good health.¶

viz = FeatureImportances(dTree, relative=False,topn=10)
viz.fit(X_train_under, y_train_under)

FeatureImportances(ax=<AxesSubplot:>,
                   estimator=RandomForestClassifier(criterion='entropy',
                                                    random_state=1),
                   relative=False, topn=10)

Case Conclusion¶

After exploring regular, smote (augmented) and down-sampled data on a Random Forest and XGBoost models, the team concluded that the down-sampled approach, using a random forest algorithm did the best job of effectively predicting companies that will go bankrupt.

There is a trade-off in the model performance, optimizing bankruptcy identification comes at the cost of falsely identifiying healthy companies as potential bankruptcies. A business decision should be made to determine the level of false positives the company is willing tolerate in order to minimize bankruptcy exposure risk.

Given the complexity of the data separation, more sophisticated classification models should be evaluated to improve the ability to predict future bankrupt companies. More research should be done on leveraging the leading indicators highlighted as significant to the model to further optimize the investment revenue realization.

	net profit / total assets	total liabilities / total assets	working capital / total assets	current assets / short-term liabilities	[(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365	retained earnings / total assets	EBIT / total assets	book value of equity / total liabilities	sales / total assets	equity / total assets	...	(sales - cost of products sold) / sales	(current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation)	total costs /total sales	long-term liabilities / equity	sales / inventory	sales / receivables	(short-term liabilities *365) / sales	sales / short-term liabilities	sales / fixed assets
0	0.200550	0.37951	0.39641	2.0472	32.3510	0.38825	0.249760	1.33050	1.1389	0.50494	...	0.121960	0.39718	0.87804	0.001924	8.4160	5.1372	82.658	4.4158	7.4277
1	0.209120	0.49988	0.47225	1.9447	14.7860	0.00000	0.258340	0.99601	1.6996	0.49788	...	0.121300	0.42002	0.85300	0.000000	4.1486	3.2732	107.350	3.4000	60.9870
2	0.248660	0.69592	0.26713	1.5548	-1.1523	0.00000	0.309060	0.43695	1.3090	0.30408	...	0.241140	0.81774	0.76599	0.694840	4.9909	3.9510	134.270	2.7185	5.2078
3	0.081483	0.30734	0.45879	2.4928	51.9520	0.14988	0.092704	1.86610	1.0571	0.57353	...	0.054015	0.14207	0.94598	0.000000	4.5746	3.6147	86.435	4.2228	5.5497
4	0.187320	0.61323	0.22960	1.4063	-7.3128	0.18732	0.187320	0.63070	1.1559	0.38677	...	0.134850	0.48431	0.86515	0.124440	6.3985	4.3158	127.210	2.8692	7.8980

	net profit / total assets	total liabilities / total assets	working capital / total assets	current assets / short-term liabilities	[(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365	retained earnings / total assets	EBIT / total assets	book value of equity / total liabilities	sales / total assets	equity / total assets	...	working capital	(sales - cost of products sold) / sales	(current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation)	total costs /total sales	long-term liabilities / equity	sales / inventory	sales / receivables	(short-term liabilities *365) / sales	sales / short-term liabilities	sales / fixed assets
count	43397.000000	43397.000000	43397.000000	43271.000000	4.331600e+04	43397.000000	43397.000000	43311.000000	43396.000000	43397.000000	...	4.340400e+04	4.327800e+04	43398.000000	4.332100e+04	43398.000000	4.125300e+04	43303.000000	4.327800e+04	43271.000000	42593.000000
mean	0.035160	0.590212	0.114431	6.314702	-3.853466e+02	-0.056107	0.093478	12.640779	2.652166	0.626868	...	7.672188e+03	-2.621959e+01	-0.010510	3.002644e+01	1.333288	4.480858e+02	17.033202	1.502328e+03	9.343074	72.788592
std	2.994109	5.842748	5.439429	295.434425	6.124303e+04	7.201326	5.713075	505.894281	62.932732	14.670597	...	7.005310e+04	5.327862e+03	13.674072	5.334454e+03	122.104445	3.234560e+04	553.049406	1.392667e+05	124.177354	2369.339482
min	-463.890000	-430.870000	-479.960000	-0.403110	-1.190300e+07	-508.410000	-517.480000	-141.410000	-3.496000	-479.910000	...	-1.805200e+06	-1.108300e+06	-1667.300000	-1.986900e+02	-327.970000	-1.244000e+01	-12.656000	-2.336500e+06	-1.543200	-10677.000000
25%	0.003429	0.268980	0.021521	1.049500	-4.908000e+01	0.000000	0.005776	0.430275	1.018500	0.295470	...	2.755425e+01	9.348500e-03	0.014649	8.753200e-01	0.000000	5.545500e+00	4.510150	4.214400e+01	3.097650	2.176800
50%	0.049660	0.471900	0.196610	1.569800	-1.034500e+00	0.000000	0.059634	1.070400	1.195350	0.505970	...	1.088350e+03	5.294300e-02	0.119670	9.509600e-01	0.006366	9.791700e+00	6.636300	7.132600e+01	5.087600	4.282500
75%	0.129580	0.688320	0.403390	2.787450	5.063425e+01	0.089446	0.150880	2.615700	2.062500	0.709100	...	4.993325e+03	1.290975e-01	0.284605	9.926400e-01	0.236052	2.018100e+01	10.394500	1.172200e+02	8.598850	9.776200
max	94.280000	480.960000	28.336000	53433.000000	1.250100e+06	543.250000	649.230000	53432.000000	9742.300000	1099.500000	...	6.123700e+06	2.931500e+02	552.640000	1.108300e+06	23853.000000	4.818700e+06	108000.000000	2.501600e+07	23454.000000	294770.000000

	net profit / total assets	total liabilities / total assets	working capital / total assets	current assets / short-term liabilities	[(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365	retained earnings / total assets	EBIT / total assets	book value of equity / total liabilities	sales / total assets	equity / total assets	...	working capital	(sales - cost of products sold) / sales	(current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation)	total costs /total sales	long-term liabilities / equity	sales / inventory	sales / receivables	(short-term liabilities *365) / sales	sales / short-term liabilities	sales / fixed assets
class
0	0.053098	0.543811	0.157384	6.404580	-366.252015	0.000040	0.113920	12.918564	2.688207	0.676952	...	8020.298733	-27.279235	0.048536	31.429929	1.369922	441.145427	17.080733	1501.643189	9.337911	69.770166
1	-0.319401	1.504741	-0.734674	4.134217	-746.216002	-1.165242	-0.310769	6.584036	1.928654	-0.365093	...	790.534411	-3.690713	-1.177105	1.090152	0.605021	124.056554	15.263197	1424.608922	8.846351	104.160628